25 research outputs found
FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector
In this paper, we present frame reconstruction model: FrameRS. It consists
self-supervised video frame reconstructor and key frame selector. The frame
reconstructor, FrameMAE, is developed by adapting the principles of the Masked
Autoencoder for Images (MAE) for video context. The key frame selector, Frame
Selector, is built on CNN architecture. By taking the high-level semantic
information from the encoder of FrameMAE as its input, it can predicted the key
frames with low computation costs. Integrated with our bespoke Frame Selector,
FrameMAE can effectively compress a video clip by retaining approximately 30%
of its pivotal frames. Performance-wise, our model showcases computational
efficiency and competitive accuracy, marking a notable improvement over
traditional Key Frame Extract algorithms. The implementation is available on
Githu
DDMM-Synth: A Denoising Diffusion Model for Cross-modal Medical Image Synthesis with Sparse-view Measurement Embedding
Reducing the radiation dose in computed tomography (CT) is important to
mitigate radiation-induced risks. One option is to employ a well-trained model
to compensate for incomplete information and map sparse-view measurements to
the CT reconstruction. However, reconstruction from sparsely sampled
measurements is insufficient to uniquely characterize an object in CT, and a
learned prior model may be inadequate for unencountered cases. Medical modal
translation from magnetic resonance imaging (MRI) to CT is an alternative but
may introduce incorrect information into the synthesized CT images in addition
to the fact that there exists no explicit transformation describing their
relationship. To address these issues, we propose a novel framework called the
denoising diffusion model for medical image synthesis (DDMM-Synth) to close the
performance gaps described above. This framework combines an MRI-guided
diffusion model with a new CT measurement embedding reverse sampling scheme.
Specifically, the null-space content of the one-step denoising result is
refined by the MRI-guided data distribution prior, and its range-space
component derived from an explicit operator matrix and the sparse-view CT
measurements is directly integrated into the inference stage. DDMM-Synth can
adjust the projection number of CT a posteriori for a particular clinical
application and its modified version can even improve the results significantly
for noisy cases. Our results show that DDMM-Synth outperforms other
state-of-the-art supervised-learning-based baselines under fair experimental
conditions.Comment: llncs.cls v2.20,12 pages with 6 figure
Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality
Multimodal sentiment analysis (MSA) is an important way of observing mental
activities with the help of data captured from multiple modalities. However,
due to the recording or transmission error, some modalities may include
incomplete data. Most existing works that address missing modalities usually
assume a particular modality is completely missing and seldom consider a
mixture of missing across multiple modalities. In this paper, we propose a
simple yet effective meta-sampling approach for multimodal sentiment analysis
with missing modalities, namely Missing Modality-based Meta Sampling (M3S). To
be specific, M3S formulates a missing modality sampling strategy into the modal
agnostic meta-learning (MAML) framework. M3S can be treated as an efficient
add-on training component on existing models and significantly improve their
performances on multimodal data with a mixture of missing modalities. We
conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior
performance is achieved compared with recent state-of-the-art methods
MAP-SNN: Mapping Spike Activities with Multiplicity, Adaptability, and Plasticity into Bio-Plausible Spiking Neural Networks
Spiking Neural Network (SNN) is considered more biologically realistic and
power-efficient as it imitates the fundamental mechanism of the human brain.
Recently, backpropagation (BP) based SNN learning algorithms that utilize deep
learning frameworks have achieved good performance. However,
bio-interpretability is partially neglected in those BP-based algorithms.
Toward bio-plausible BP-based SNNs, we consider three properties in modeling
spike activities: Multiplicity, Adaptability, and Plasticity (MAP). In terms of
multiplicity, we propose a Multiple-Spike Pattern (MSP) with multiple spike
transmission to strengthen model robustness in discrete time-iteration. To
realize adaptability, we adopt Spike Frequency Adaption (SFA) under MSP to
decrease spike activities for improved efficiency. For plasticity, we propose a
trainable convolutional synapse that models spike response current to enhance
the diversity of spiking neurons for temporal feature extraction. The proposed
SNN model achieves competitive performances on neuromorphic datasets: N-MNIST
and SHD. Furthermore, experimental results demonstrate that the proposed three
aspects are significant to iterative robustness, spike efficiency, and temporal
feature extraction capability of spike activities. In summary, this work
proposes a feasible scheme for bio-inspired spike activities with MAP, offering
a new neuromorphic perspective to embed biological characteristics into spiking
neural networks
Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation
When applying a pre-trained 2D-to-3D human pose lifting model to a target
unseen dataset, large performance degradation is commonly encountered due to
domain shift issues. We observe that the degradation is caused by two factors:
1) the large distribution gap over global positions of poses between the source
and target datasets due to variant camera parameters and settings, and 2) the
deficient diversity of local structures of poses in training. To this end, we
combine \textbf{global adaptation} and \textbf{local generalization} in
\textit{PoseDA}, a simple yet effective framework of unsupervised domain
adaptation for 3D human pose estimation. Specifically, global adaptation aims
to align global positions of poses from the source domain to the target domain
with a proposed global position alignment (GPA) module. And local
generalization is designed to enhance the diversity of 2D-3D pose mapping with
a local pose augmentation (LPA) module. These modules bring significant
performance improvement without introducing additional learnable parameters. In
addition, we propose local pose augmentation (LPA) to enhance the diversity of
3D poses following an adversarial training scheme consisting of 1) a
augmentation generator that generates the parameters of pre-defined pose
transformations and 2) an anchor discriminator to ensure the reality and
quality of the augmented data. Our approach can be applicable to almost all
2D-3D lifting models. \textit{PoseDA} achieves 61.3 mm of MPJPE on MPI-INF-3DHP
under a cross-dataset evaluation setup, improving upon the previous
state-of-the-art method by 10.2\%
Blind Inpainting with Object-aware Discrimination for Artificial Marker Removal
Medical images often contain artificial markers added by doctors, which can
negatively affect the accuracy of AI-based diagnosis. To address this issue and
recover the missing visual contents, inpainting techniques are highly needed.
However, existing inpainting methods require manual mask input, limiting their
application scenarios. In this paper, we introduce a novel blind inpainting
method that automatically completes visual contents without specifying masks
for target areas in an image. Our proposed model includes a mask-free
reconstruction network and an object-aware discriminator. The reconstruction
network consists of two branches that predict the corrupted regions with
artificial markers and simultaneously recover the missing visual contents. The
object-aware discriminator relies on the powerful recognition capabilities of
the dense object detector to ensure that the markers of reconstructed images
cannot be detected in any local regions. As a result, the reconstructed image
can be close to the clean one as much as possible. Our proposed method is
evaluated on different medical image datasets, covering multiple imaging
modalities such as ultrasound (US), magnetic resonance imaging (MRI), and
electron microscopy (EM), demonstrating that our method is effective and robust
against various unknown missing region patterns
SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
As an important and challenging problem in computer vision, PAnoramic
Semantic Segmentation (PASS) gives complete scene perception based on an
ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic
image input focus on solving image distortions but lack consideration of the 3D
properties of original data. Therefore, their performance will
drop a lot when inputting panoramic images with the 3D disturbance. To be more
robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer
for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical
geometry knowledge. Specifically, a spherical geometry-aware framework is
proposed for PASS. It includes three modules, i.e., spherical geometry-aware
image projection, spherical deformable patch embedding, and a panorama-aware
loss, which takes input images with 3D disturbance into account, adds a
spherical geometry-aware constraint on the existing deformable patch embedding,
and indicates the pixel density of original data, respectively.
Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS
significantly improves performance and robustness, with approximately a 2%
increase in mIoU, and when small 3D disturbances occur in the data, the
stability of our performance is improved by an order of magnitude. Our code and
supplementary material are available at
https://github.com/TencentARC/SGAT4PASS.Comment: Accepted by IJCAI 202
DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models
Image-based fashion design with AI techniques has attracted increasing
attention in recent years. We focus on a new fashion design task, where we aim
to transfer a reference appearance image onto a clothing image while preserving
the structure of the clothing image. It is a challenging task since there are
no reference images available for the newly designed output fashion images.
Although diffusion-based image translation or neural style transfer (NST) has
enabled flexible style transfer, it is often difficult to maintain the original
structure of the image realistically during the reverse diffusion, especially
when the referenced appearance image greatly differs from the common clothing
appearance. To tackle this issue, we present a novel diffusion model-based
unsupervised structure-aware transfer method to semantically generate new
clothes from a given clothing image and a reference appearance image. In
specific, we decouple the foreground clothing with automatically generated
semantic masks by conditioned labels. And the mask is further used as guidance
in the denoising process to preserve the structure information. Moreover, we
use the pre-trained vision Transformer (ViT) for both appearance and structure
guidance. Our experimental results show that the proposed method outperforms
state-of-the-art baseline models, generating more realistic images in the
fashion design task. Code and demo can be found at
https://github.com/Rem105-210/DiffFashion
Devil in the Number: Towards Robust Multi-modality Data Filter
In order to appropriately filter multi-modality data sets on a web-scale, it
becomes crucial to employ suitable filtering methods to boost performance and
reduce training costs. For instance, LAION papers employs the CLIP score filter
to select data with CLIP scores surpassing a certain threshold. On the other
hand, T-MARS achieves high-quality data filtering by detecting and masking text
within images and then filtering by CLIP score. Through analyzing the dataset,
we observe a significant proportion of redundant information, such as numbers,
present in the textual content. Our experiments on a subset of the data unveil
the profound impact of these redundant elements on the CLIP scores. A logical
approach would involve reevaluating the CLIP scores after eliminating these
influences. Experimentally, our text-based CLIP filter outperforms the
top-ranked method on the ``small scale" of DataComp (a data filtering
benchmark) on ImageNet distribution shifts, achieving a 3.6% performance
improvement. The results also demonstrate that our proposed text-masked filter
outperforms the original CLIP score filter when selecting the top 40% of the
data. The impact of numbers on CLIP and their handling provide valuable
insights for improving the effectiveness of CLIP training, including language
rewrite techniques.Comment: ICCV 2023 Workshop: TNGCV-DataCom